Data Analysis and Visualization

Introduction

In this section, I am going to use the dataframe which is now clean and tidy through the wrangling process in the last section to some data analysis and present the results by visualization.

I am interested in some questions and they can be divided into two groups. The first group contains only one variable, and the other group contains more than one so we will know the correlation between some variables.

1. Which breed of dog are commonly seen on the page?

In [3]:
# The table shows the distribution of dog breeds in WeRateDogs

Golden Retriever is the most common dog we can see in WeRateDogs with nearly 160 times of appearance, which is far more than the number of Labrador Retriever of the second position (~110 times). Pembroke stands at the thrid place, actually very close to Chihuahua which comes following.

A slight digression. I was thinking about what is Pembroke? I search the name on Google and find that it actually means Pembroke Welsh Corgi! For me they should be the champion because they are the cutest dog in the world!

2. Which names are common for dogs?

In [4]:
# This table shows the frequency of dog names in WeRateDogs

The result is quite average. Charlie and Lucy are the most common name among all (11 times) but just slightly more than the other names. Oliver and Cooper are at the second position with 10 appearances. This is a interesting question for me because I only know how people give name to their dogs in my language and it is good to know this in English.

3. Which stage of dog are most commonly seen on the page?

In [5]:
# This table shows the frequency of stage of dog in WeRateDogs

Pupper is the stage of dog that can be seen very often on the page, much higher frequency than doggo at the second position. (Though I can't really distinguish the difference between them)

4. Which tweet got highest number of like and retweet?

I try some new things here to present the answer. I want to find out the answer and directly show the picture of that tweet. Also there is tweet ID and count of like/retweet marked on the picture.

In [7]:
# This is the picture which received the most likes
In [8]:
# The is the picture got the most retweet

I didn't expect that two answers come out the same picture. The picture above got both the most retweet and likes among all the tweets in WeRateDogs. However, in my opinion, I don't think the picture anything special though.

5. How do WeRateDogs rate the pictures?

In [9]:
# This table shows the distribution of rating given by the editor of WeRateDogs

Because of the special rating system in other to make tweets look funny, most of the rating are equal or higher than 10. 12 is the most commonly seen rating, then 10 and 11 come to the second. The average rating given by WeRateDogs is 10.67, higher than the denominator as expected.

6. How do WeRateDogs write their tweet?

In [10]:
# I create a word cloud to see the words they use frequently

Not surprising, this is a dog rating accout so words like "pupper", "dog" and "pup" are very common. And words like "meet", "love", hello", "happy" are always used, proving that the editor wants to make the page warm and cute. The funniest thing for me is that "af" is highly used in their tweets, that makes me laugh after seeing the word cloud.

7. Which breed of dog get highest rate from the editor?

In [11]:
# The table shows the average rating of different breeds of dog
# Get both mean and count because the more appearance the higher representativeness
Out[11]:
mean count
breed_prediction
bouvier_des_flandres 13.000000 1
saluki 12.500000 4
briard 12.333333 3
tibetan_mastiff 12.250000 4
border_terrier 12.142857 7
silky_terrier 12.000000 1
standard_schnauzer 12.000000 1
gordon_setter 11.750000 4
irish_setter 11.750000 4
samoyed 11.690476 42
golden_retriever 11.560510 157
giant_schnauzer 11.500000 4
wire-haired_fox_terrier 11.500000 2
australian_terrier 11.500000 2
great_pyrenees 11.466667 15
norfolk_terrier 11.428571 7
chow 11.416667 48
pembroke 11.410526 95
eskimo_dog 11.409091 22
doberman 11.333333 9
greater_swiss_mountain_dog 11.333333 3
cocker_spaniel 11.333333 30
irish_water_spaniel 11.333333 3
leonberg 11.333333 3
kelpie 11.307692 13
siberian_husky 11.300000 20
bernese_mountain_dog 11.272727 11
clumber 11.270000 1
labrador_retriever 11.194444 108
french_bulldog 11.193548 31

Bouvier des Flandres has the highest average rating among all the breed but the fact is that it has only one appearance. If we are talking about the breed which is rated more than 5 times, Border Terrier got the highest rate with average 12.14 score. By the way, because I'm not so familiar with dogs especially in English name, I decide to make a simple function to show me the picture. The dog Iwould like to know how it looks like is Border Terrier.

In [13]:
# This is Border Terrier

OMG! It seriouly melts my heart!

8. Which breed of dog get most favorite count from the public?

In [14]:
# The table shows which breed got most average like from twitter users
Out[14]:
mean count
breed_prediction
bedlington_terrier 22576.166667 6
saluki 21722.500000 4
french_bulldog 18333.833333 30
bouvier_des_flandres 16090.000000 1
afghan_hound 15398.000000 3
black-and-tan_coonhound 15297.000000 2
flat-coated_retriever 15141.500000 8
irish_water_spaniel 14623.666667 3
leonberg 13262.333333 3
whippet 13194.727273 11

Bedlington Terrier is the most popular dog among all the breeds on WeRateDogs and got 22642.5 average likes. French Bulldog can be seen quite frequently (over 30 times) and it received over 18000 average likes proving that people really love them.

Actually I have no idea how Bedlington Terrier looks like because there are so many dogs call "Terrier", so I want to see one of the picture of it.

In [15]:
# This is Bedlington Terrier...emmm...wait

Even I know Bedlington Terrier looks a bit similar to sheep, I'm pretty sure this is a lamb. How come the image predictor judged it as a Bedlington Terrier and gave it the highest confidence level among all possibilities? So in the next question I would like to look into confidence level.

9 .Which breeds of dog are hard to recognize by the image predictor?

In [16]:
# The table shows the average confidence level of different breed
Out[16]:
mean count
breed_prediction
irish_wolfhound 0.063078 1
bouvier_des_flandres 0.082610 1
scottish_deerhound 0.143519 4
norwich_terrier 0.246893 5
cairn 0.262196 3
scotch_terrier 0.267979 1
standard_poodle 0.272770 11
bedlington_terrier 0.286043 6
mexican_hairless 0.300638 7
groenendael 0.302625 2

Irish Wolfhound and Bouvier des Flandres recorded lower than 0.1 confidence level but they have only one appearance. Norwich Terrier got few appearances and recorded low confidence level as well with only 0.25. Bedlington Terrier in the last question is also on the list with 0.28 confidence level. This time I would like to see a sample picture of Norwich Terrier.

In [17]:
# This is Norwich Terrier...?

Is there any Norwich Terrier with such long legs? I'm not quite sure about the breed of this dog but I highly doubt that it is a Norwich Terrier. Now I can understand why the confidence level of predicting a Norwich Terrier is low.

10. Do picture with low confidence level receive more like?

Another question about convidence level. I was thinking that a picture with low confidence level would mean the picture is funny/humor, and it is more likey to receive higher numbers of like. So we are going to look at the correlation between confidence level and favorite count.

In [18]:
# The scatter chart shows correlation between confidence level and favorite count
In [19]:
# calculate correlation value
Out[19]:
0.073482616497382303

0.07 is a low positive correlation vaule. Implying that there is no significant relationship between two variables, which are confidnece level and favorite count. Lower confidence level wouldn't help to get more like. My hypothesis is wrong.

11. Do picture with higher rating receive more like?

In [20]:
# The scatter chart shows correlation between rating and number of like
In [21]:
# calculate correlation value
Out[21]:
0.37713333890489564

Correlation value 0.38 indicate a moderate positive linear relationship between two variables. That means the higher the picture is rated, it tends to receive more like from the Twitter users.

Conclusion

To sum up in short, we got some very interesting insight from the data. We know the inclination of the editor in choosing breed of dog to write post and the wording of writing a post. On the other hand, we know whta kind of dog, or which style of picture would get more likes and retweets. Of course very importantly, it proved the some hypothesis in my head which turn out may be right or wrong. This is a funny project.